Assignment 2


1. Reading the Data

read.delim() is used to read tab seperated data. Therafter, the required columns are filtered and first column is used as a rownames. The dataset is scaled to make it ready for plotting heatmap.

library(plotly)
library(seriation)
library(GGally)
library(tidyverse)


pricesnearn <-read.delim("prices-and-earnings.txt",stringsAsFactors = FALSE)
pricesnearn <- pricesnearn[,c(1,2,5,6,7,9,10,16,17,18,19)]

row.names(pricesnearn) <- pricesnearn$City
scaledprices <- scale(pricesnearn[, c(-1)])


2. Heatmap

Without reordering, very small cluster is to be seen and difficult to see outliers.

m <- list(
  l = 10,
  r = 10,
  b = 50,
  pad = 2
)

heatmapp<-function(df,str) {
plot_ly(width=900, height=1200, 
    x = colnames(df),
    y = rownames(df),
    z = df,
    type = "heatmap",
    colors = colorRamp(c("yellow", "red")),
    colorbar = list(title = "Color-Scale")
) %>% layout(title=str,margin=m)
}

heatmapp(scaledprices,"Heatmap of Countries")


3. HeatMaps and Analysis

On comparing two heatmap generated based on Euclidean Distance and 1-Corr which compute orders that optimizes Hamiltonian Path Length and use Hierarchical Clustering (HC) as the optimization algorithm, heatmap generated by Euclidean distance seems to have better clustering. Even though the permutation of rows is same in both method but there is difference in how columns are arranged. Using Euclidean method, more similar clusters are visible between countries because the gradient of color is less.

Using 1- Corr the similar clusters are visible for Bread, Big Mac and iphone prices among various countries around central left part of the map while with Euclidean distance more similar clusters are visible around top 1/4 of the map, on left central and bottom right of the map.

dist_cols<-dist(t(scaledprices),method="euclidean",diag=FALSE)
dist_rows<-dist(scaledprices,method="euclidean",diag=FALSE)

    
order1_euc<-seriate(dist_rows,"OLO")
order2<-seriate(dist_cols,"OLO")


ord1<-get_order(order1_euc)
ord2<-get_order(order2)

scaledprices_euc <- scaledprices[rev(ord1),ord2]

heatmapp(scaledprices_euc,"Heatmap using Euclidean Distances")
corr_rows<-1-cor(t(scaledprices))
corr_cols<-1-cor(scaledprices)

order1<-seriate(as.dist(corr_rows),"OLO")
order2<-seriate(as.dist(corr_cols),"OLO")
ord1<-get_order(order1)
ord2<-get_order(order2)


scaledprices_corr <- scaledprices[ord1,ord2]

heatmapp(scaledprices_corr,"Heatmap using 1-Correlation as Distances")


4. TSP using Hamiltonian Path Length

Heatmap of the dataset is generated by permutation of rows and columns using Hamiltonian Path Length which uses TSP as a optimizer. The rows and column in this heatpmap are arranged differently from heatmap generated by HC. Additonally, the clusters arranged by TSP are more similar than HC which is visually evident.

order1_tsp<-seriate(dist_rows,"TSP")
order2<-seriate(dist_cols,"TSP")
ord1<-get_order(order1_tsp)
ord2<-get_order(order2)
TSP_solv<-criterion(dist_rows,order1,method=c("Path_Length","Gradient_raw"))

scaledprices_euc <- scaledprices[rev(ord1),ord2]

heatmapp(scaledprices_euc,"Heatmap-TSP using Hamiltonian Path Length")


Here we compare how efficient is TSP against Hierarchical clustering as an optimizer for clustering on basis of different loss/merit function.

  • Path length using TSP solver was lesser than HC which means TSP was better.
  • Gradient_weight is a merit function and the value of HC is higher than TSP which means HC is better than TSP.
  • Gradient_raw is a merit function. The value of HC is again higher than TSP which implies HC is better than TSP.
##                            HC         TSP
## Path_length          121.9671    121.8256
## Gradient_weighted 156151.0604 107595.7915
## Gradient_raw       65528.0000  42136.0000

5. Parallel Coordinate

Here there are two prominent cluster which can be identified based on variable “iphone” having value greater than 56 or lesser than 56. Clusters can be colored using red and blue .

Visually the clusters becomes much clearer if variables are arranged in following order from left to right-Big Mac,Bread,iphone4S,Rice(in kg),Hours Worked,Food Cost,Clothing Index,Wage Net,Goods and Services , Vacation.

  • For cluster with red color, outlier can be seen at food cost over 900,clothing index at value lesser than 40.
  • For cluster with blue color, the outlier can been seen for bread variable at value 70.
pricesnearn$PhoneColor<-ifelse(pricesnearn$iPhone.4S.hr. > 56 ,1 ,0)

 p <- pricesnearn %>% plot_ly(width=1000,height=600,type = "parcoords",
                              line = list(color = ~PhoneColor,
                              colorscale = list(c(0,'red'),c(1,'blue'))),
                              dimensions = list(
                        list(label = "Food Cost", values =  ~ Food.Costs...),
                        list(label = "Iphone4S", values =  ~ iPhone.4S.hr.),
                        list(label = "Clothing Index", values =  ~ Clothing.Index),
                        list(label = "Hours Worked", values =  ~ Hours.Worked),
                        list(label = "Wage Net", values =  ~ Wage.Net),
                        list(label = "Vacation Days", values =  ~ Vacation.Days),
                        list(label = "Big Mac", values =  ~ Big.Mac.min.),
                        list(label = "Bread(in kg)", values =  ~ Bread.kg.in.min.),
                        list(label = "Rice(in kg)", values =  ~ Rice.kg.in.min.),
                        list(label = "Goods and Services", values = ~Goods.and.Services...)
                                  )) %>% layout(title="Parallel Plot")
 
 p

6. Radar chart Diagram

library(scales)
Ps=list()
scaledprices_euc <- as.data.frame(scaledprices_euc)
nPlot<- nrow(scaledprices_euc)

scaledprices_euc%>% add_rownames(var="City") -> scaledprices_radar

for ( i in 1:nPlot) {
    Ps[[i]]<-htmltools::tags$div(
        plot_ly(type="scatterpolar",
                r=as.numeric(scaledprices_radar[i,-1]),
                theta= colnames(scaledprices_radar)[-1],
                fill="toself") %>% 
              layout(title=scaledprices_radar$City[i]),style="width: 25%;") 
    
}

h<-htmltools::tags$div(style="display: flex; flex-wrap: wrap; ",Ps)

htmltools::browsable(h)

First cluster includes Tokyo, Seoul, Hongkong and Toronto. On stacking one above another, it can be seen that variables of Tokyo are not similar with respect to another country . Food cost, Clothing Index and Goods and services can be considered as outlier for Tokyo with respect to other country .

Second cluster includes Chicago ,Los Angeles, Miami abd Tel Aviv. “Rice in kg” variable for Tel Aviv can be considered as outlier.

 plot_ly(
     type = 'scatterpolar',
      fill="toself"
   )%>%add_trace(
                 r=as.numeric(scaledprices_corr["Tokyo",-11]),
                 theta= colnames(scaledprices_corr)[-11],
                 name="Tokyo") %>% 
                 add_trace(
                 r=as.numeric(scaledprices_corr["Seoul",-11]),
                 theta= colnames(scaledprices_corr)[-11],
                 name="Seoul") %>%
                 add_trace(
                 r=as.numeric(scaledprices_corr["Hong Kong",-11]),
                 theta= colnames(scaledprices_corr)[-11],
                name="Hong Kong") %>%
                add_trace(
                 r=as.numeric(scaledprices_corr["Toronto",-11]),
                 theta= colnames(scaledprices_corr)[-11],
                name="Toronto")
 plot_ly(
     type = 'scatterpolar',
      fill="toself"
   )%>%add_trace(
                 r=as.numeric(scaledprices_corr["Chicago",-11]),
                 theta= colnames(scaledprices_corr)[-11],
                 name="Chicago") %>% 
                 add_trace(
                 r=as.numeric(scaledprices_corr["Los Angeles",-11]),
                 theta= colnames(scaledprices_corr)[-11],
                 name="Los Angeles") %>%
                 add_trace(
                 r=as.numeric(scaledprices_corr["Miami",-11]),
                 theta= colnames(scaledprices_corr)[-11],
                name="Miami") %>%
                add_trace(
                 r=as.numeric(scaledprices_corr["Tel Aviv",-11]),
                 theta= colnames(scaledprices_corr)[-11],
                name="Tel Aviv")


###7. Plots for analyzing data Among heatmaps,parallel coordinate and radar charts, heatmaps seems to be best at analysing the similarity of variable and objects due to visible gradient in color. With parallel plots it becomes difficult to trace all lines and some times it might become difficult to see pattern. With Radar chart it might become difficult to judge orientation.Radar chart and parallel plots are good at identifying outliers.